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Abstract 

Automated sensing instruments on satellites and aircraft have enabled the collection 
of massive amounts of high-resolution observations of spatial fields over large spatial 
regions. If these datasets can be efficiently exploited, they can provide new insights 
on a wide variety of issues. However, traditional spatial-statistical techniques such as 
kriging are not computationally feasible for big datasets. We propose a multi-resolution 
approximation (Af-RA) of Gaussian processes observed at irregular locations in space. 
The M -RA process is specified as a linear combination of basis functions at multiple 
levels of spatial resolution, which can capture spatial structure from very fine to very 
large scales. The basis functions are automatically chosen to approximate a given co- 
variance function, which can be nonstationary. All computations involving the Af-RA, 
including parameter inference and prediction, are highly scalable for massive datasets. 
Crucially, the inference algorithms can also be parallelized to take full advantage of 
large distributed-memory computing environments. In comparisons using simulated 
data and a large satellite dataset, the Af-RA outperforms a related state-of-the-art 
method. 
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1 Introduction 

Automated sensing instruments on satellites and aircraft have enabled the collection of mas¬ 
sive amounts of high-resolution observations of spatial fields over large and inhomogenous 
spatial domains. If these kinds of datasets can be efficiently exploited, they can provide 
new insights on a wide variety of issues, such as greenhouse gas concentrations for climate 
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change, soil properties for precision agriculture, and atmospheric states for weather forecast¬ 
ing. Based (implicitly or explicitly) on Gaussian processes, the held of spatial statistics pro¬ 
vides a rich toolkit for the analysis of such data, including estimating unknown parameters, 
predicting the spatial held at unobserved locations, and properly quantifying uncertainty in 
the predictions and parameters (e.g., Cressie and Wikle, 2011). 

However, traditional spatial-statistical techniques such as kriging are not computationally 
feasible for big datasets, because dense n x n matrices need to be decomposed, where n is the 
number of measurements. General-purpose methods such as the preconditioned conjugate 
gradient algorithm (e.g., Golub and Van Loan, 2012) or probabilistic projections (e.g., Halko 
et ah, 2011; Stein et ah, 2013) can solve large linear systems. But it is challenging for 
such algorithms to calculate the determinant of the data covariance matrix also required for 
likelihood-based inference, as this covariance is huge, often dense, and might have a slowly 
decaying spectrum. 

More specialized methods for approximating spatial inference that explicitly try to exploit 
spatial information in the data have been proposed in the literature, but most of them either 
require restrictive assumptions about the covariance function (e.g., Fuentes, 2007; Lindgren 
et al., 2011), or they ignore much of the fine-scale dependence (e.g., Higdon, 1998; Mardia 
et ah, 1998; Calder, 2007; Cressie and Johannesson, 2008; Katzfuss and Cressie, 2009; Lernos 
and Sanso, 2009; Katzfuss and Cressie, 2011, 2012) or the large-scale dependence (e.g., Furrer 
et ah, 2006; Kaufman et ah, 2008; Shaby and Ruppert, 2012). Composite-likelihood methods 
(e.g., Vecchia, 1988; Curriero and Lele, 1999; Stein et ah, 2004; Caragea and Smith, 2007; 
Bevilacqua et ah, 2012; Eidsvik et ah, 2014) achieve computational feasibility by treating 
(blocks of) observations as conditionally independent, but it is not clear how to obtain proper 
joint predictive distributions for locations in different blocks. 

We propose here a multi-resolution approximation (M-RA) of Gaussian processes ob¬ 
served at irregular (i.e., non-griclded) locations in space. The M-RA process is specified as 
a linear combination of basis functions at multiple levels of spatial resolution, which can 
capture spatial structure from very fine to very large scales. Multi-resolution models (e.g. 
Chui, 1992; Johannesson et ah, 2007; Nychka et ah, 2015) have been very successful in spatial 
statistics, clue to their ability to flexibly capture dependence at multiple spatial scales while 
being computationally feasible. In constrast to these existing methods, in our M-RA the 
basis functions and the distributions of their weights are chosen to “optimally” approximate 
a given covariance function, without requiring any restrictions on this covariance function. 
The basis functions in each region at each resolution are chosen iteratively according to the 
rules of the predictive process (Quinonero Candela and Rasmussen, 2005; Banerjee et ah, 
2008), based on a recursive partitioning of the spatial domain into smaller and smaller sub- 
regions, and a set of “knot” locations in each region. The M-RA is a generalization of the 
full-scale approximation (Snelson and Ghahramani, 2007; Sang et ah, 2011; Sang and Huang, 
2012), a current state-of-the-art method for covariance approximations for large spatial data, 
which has only one level of domain partitioning and one resolution of basis functions. We 
will compare the two approaches extensively. 

Inference for basis-function models essentially consists of obtaining the posterior dis¬ 
tribution of the basis-function weights. In previous approaches, computationally feasible 
inference has been achieved by limiting the number of basis functions to be small (e.g., Hig¬ 
don, 1998; Quinonero Candela and Rasmussen, 2005; Cressie and Johannesson, 2008) or by 
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specifying the precision matrix of the weight vector to be diagonal or sparse (e.g., Higdon, 
1998; Lindgren et ah, 2011; Nychka et al., 2015). The M-RA combines both approaches: 
It results in a multi-resolution (block) sparse precision matrix, but the number of spatial 
basis functions within each region is small, allowing repeated application of the Sherman- 
Morrison-Woodbury formula (Sherman and Morrison, 1950; Woodbury, 1950). This leads 
to a highly scalable inference algorithm for the M-RA (which could also be applied to any 
multi-resolution basis-function model with a similar structure). Crucially, based on previous 
work (Katzfuss and Hammerling, 2014) describing parallel algorithms for a special case of 
the M-RA, we derive algorithms that can split the computing task efficiently between many 
nodes. This way, spatial inference could be carried out for massive spatial datasets, using 
parallel computations at a number of nodes each dealing only with a subset of the dataset. If 
enough computing nodes are available, this ensures scalability of the M-RA even for datasets 
with many millions of observations. 

This article is organized as follows. In Section 2, we introduce the M-RA and discuss 
some of its properties. In Section 3, we present algorithms for parameter inference and 
spatial prediction, and describe the computational complexity of the M-RA. In Section 4, 
we apply the M-RA to large simulated datasets and to a real-data example and compare 
the M-RA to the full-scale approximation. We conclude in Section 5. All proofs are given 
in Appendix A. 

2 Multi-resolution approximation (M-RA) 

We begin this section by describing the true Gaussian process to be approximated (Section 
2.1). After some preliminaries (Section 2.2), we introduce the multi-resolution approximation 
(M-RA) (Section 2.3) and discuss its properties (Sections 2.4 and 2.5). 

2.1 The true Gaussian process 

Let (|/o( s ) : s £ M, or Uo('), be the true spatial held or process of interest on a continuous 
(non-gridded) domain D C M d , d G N + . We assume that y o(-) ~ GP(0,Go) is a zero- 
mean Gaussian process with covariance function Co- We place no restrictions on Go, other 
than assuming that it is a positive-definite function on T> that is known up to a vector of 
parameters, 0. In real applications, yo(-) will often not have mean zero, but estimating and 
subtracting the mean is usually not a computational problem. Once yo(-) has been observed 
at a set of n spatial locations, S = {si,..., s n }, the distribution of the data is given by 

yo(«S) := ( 2 / 0 (si),..., 2 /o(s re )) / ~ N n (0,C 0 (S,S)), 

an n-variate Gaussian distribution with covariance matrix Co(S,S) = (Co(sj,Sj)). n . 

The basic goal of spatial statistics is to make (likelihood-based) inference on the param¬ 
eters 6 and to obtain spatial predictions of y 0 (-) at a set of locations S p (i.e., to obtain the 
posterior distribution of y 0 (<S p )). This requires multiple Cholesky decompositions of the 
data covariance matrix Cq(S,S), which generally has 0(n 3 ) time and 0(n 2 ) memory com¬ 
plexity. This is computationally infeasible when n = 10 5 or more. In addition, computations 
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for inference are also very difficult to parallelize, as computations with a dense and large 
covariance matrix require substantial communication overhead. Thus, the computational 
challenges cannot be solved by brute-force use of high-performance computing systems, and 
approximations or simplifying assumptions are necessary. 

2.2 Domain partitioning and knots 

To define the M- RA, we need a recursive partitioning of the spatial domain T>, in which 
each of J regions, , Vj, is again divided into J smaller subregions, and so forth, up to 

level M: 


Uj TO =l,..., J > jli ■ ■ ■ 1 jm — 1 1 1 • ■ ■ 1 J i m l,...,Tf, 

For a generic Gaussian process x(-) ~ GP( 0, C), we define [x(-)][ m ] to be a “block-independent” 
version of x(-) between regions at resolution m; that is, [x(-)][ m ] ~ GP( 0, [C ][ m ]), where 
[C] [m] (s i , s 2 ) = C(s 1 ,s 2 ) if S!,s 2 are in the same region at resolution m, and 

[C'][m](si,s 2 ) = 0 otherwise. 

We also need a multi-resolutional set of knots, such that Qj u ...,j m is a set of r knots (with 
r « n ) that all he in subregion For ease of notation, we assume that the knots 

in each of the regions at resolution M are given by the observation locations in that region: 

Qj != S h,-,3M- Further, we write Q (m) = { Q h . Jm : j u ..., j m = 1,..., J} for the set of 

all rJ m knots at resolution m. For a one-dimensional toy example, the top row of Figure 1 
shows partitions and knots for M-RAs with M — 1 and M — 3. The knots are the locations 
at which the basis functions attain their maximum. 

Note that we have assumed the same number (J) of subregions in each partition and 
the same number (r) of knots in each subregion, but this is only for notational convenience 
and not necessary for the inference described later. Hereafter, we will assume the domain 
partitioning and knots to be fixed and known. Some further discussion of their choice is 
given in Section 2.5. 

2.3 Definition of the multi-resolution approximation (M-RA) 

Recall the true Gaussian process yo(-) ~ GP( 0, Go) from Section 2.1. We assume temporarily 
that the covariance function Go is fully known (parameter inference is discussed in Section 
3.3). The M-RA process approximates yo(-) and its covariance G 0 iteratively at resolutions 
m = 0,..., M, based on the knots and partitions specified in Section 2.2. At each res¬ 
olution m, it approximates the “remainder” term — the difference between y 0 (•) and the 
approximations at lower resolutions — using the predictive process (Banerjee et ah, 2008), 
independently between regions As illustrated in Panel (d) of Figure 1, low M-RA 

resolutions captures variability at low frequencies (i.e., at large distances), resulting in re¬ 
mainder terms that exhibit variability on smaller and smaller scales as m increases, and so 
approximating the remainder independently between finer and finer partitions causes little 
approximation error. 

More specifically, we begin with a predictive-process approximation of yo(-), To(s) : = 
G(y 0 (s)|yo(Q (0) )), s 6 D, and we approximate the remainder process by assuming it to be 
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(a) Simulated data (red dots), and J = 9 partitions 
(red vertical lines) and r = 6 basis functions (blue) 
for a 1-RA 



(b) Data (red dots), and J = 3 partitions (vertical 
lines) and r = 2 basis functions per region for a 
3-RA 



(c) True covariance and 1-RA up to resolution m 


(d) True covariance and M-RA up to resolution m 



(e) Posterior predictive means (solid lines) and pointwise 95% posterior prediction intervals (dashed lines) 
for y(-). The interval bounded by the gray vertical lines is “unobserved.” 

Figure 1: Comparison of a full-scale approximation (1-RA) to a multi-resolution approximation with 3 
resolutions (3-RA) with the same computational complexity (Mr = 6 for both models) in a toy example of 
n = 54 observations generated from an exponential covariance function on a one-dimensional spatial domain 
V = [0,1]. Panels (c) and (d) show the covariance approximations using only the basis functions up to 
resolution m, for m = 0,..., M. As m increases, deviations between the true and approximated covariance 
occur only on smaller and smaller scales (distances). For m = M, the covariance and predictions of the 
3-RA are exactly the same as the true covariance and the corresponding (optimal) predictions of the 0-RA, 
whereas the covariance approximations and predictions using the 1-RA differ considerably from the truth. 
Note that for other covariance functions or in higher dimensions, the M-RA will not generally be exact (see 
Section 2.5). 
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independent between regions V i,... ,Vj at resolution 1: (b(-) = [ 2 / 0 (*) — r o(‘)][i] (Sang et al., 
2011). We then again approximate this remainder process as the sum of its predictive process, 
T\ (s) = s G V, plus the approximate remainder ^ 2 (-) = [<5i(-) — (•)][ 2 ], 

and so forth, up to level M. This leads to the following expression for the M- RA: 


Um(') — To(-) + Ti(-) + ... + tm-i(-) + (1) 

where r m (s) = F(S m (s)jS m (Q (m) )), s G V, and S m (-) = [<5 m _i(-) - r m - i(-)][m] ~ GP(0,v m ). 

An alternative expression for the M -RA in (1) can be obtained by noting that, for 
m — 0,..., M — 1, we can write each predictive process for s G as a linear com¬ 
bination of basis functions (cf., Katzfuss, 2013), r m (s) = with 'Hj 1: ...j rn ~ 

Ay(0, . and 

bj'l,... jm( S ) - = V m { s ) S £ Pjl,...,jm 

• = Qjl,-,jm) (2) 

Ati+i( s 1 ) s 2 ) : = ^m(Sl, S 2 ) - bji j ...j m (Si) / K J - li ...j rri b Jli ... Jm (s 2 ), Si, S 2 G 

where u m+ i(si,s 2 ) = 0 if Si and s 2 are in different regions at the mth resolution, and we set 
v 0 = Cq. Therefore, the M -RA can also be written as a linear combination of basis functions 
at M resolutions 0,..., M — 1, plus a remainder term at resolution M: 

y M { s) = b(s)'r/ + + ... + ly ; . j, v ; (s)'7/ ;i . JXI , + S M { s), s G V n . jxr (3) 

where the weight vectors are independent of each other and of Sm(-) ~ GP( 0, %)■ Panels (a) 
and (c) of Figure 1 show the basis functions in the toy example. Once we have observed data 
at locations S, we can also write the remainder Sm(-) in (3) as a linear combination of basis 
functions, S M { s) = b„.. /A ,is)'■//, : . j,,. s G %.y u , as in (2), where 

2.4 Properties of the M -RA 
Many basis functions 

In contrast to so-called low-rank approaches, which rely on a small or moderate number of 
basis functions and for which capturing small-scale variation is challenging (see, e.g., Stein, 
2014), the total number of basis functions for the M -RA with M > 0 is larger than the 
number of measurements: r tota i = r X^m=o = r ^ M + r Em=o = n + r> n - 
allows the M -RA to capture variation at all spatial scales, including very small scales. 

Orthogonal decomposition 

Because the predictive process is a conditional expectation, which is a projection operator, 
the predictive process r m (-) is independent of the remainder S m (-) — r m (-), for all m = 
0,..., M — 1. Hence, we define the M-RA in (1) as a sum of orthogonal components. In 
the form (3), the M-RA is a weighted sum of spatial basis functions, for which the weights 
are block-orthogonal in probability space, but two sets of basis functions b ;i ... ;Jm| 
and bj^...^ are only block-orthogonal in physical space if H = 0. 
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(a) True covariance matrix 



(b) First 3 resolutions of basis functions and partitions (dashed lines) 


Figure 2: For a spatial process with a nonstationary Matern covariance function with range 0.15 and spatially 
varying (increasing) smoothness i/(s) = 0.2+0.7s on a one-dimensional domain T> = [0,1], basis functions and 
partitions of a M -RA with J = 2 and r = 3. Note how the basis functions adapt to the increasing smoothness 
of the true covariance function and according to the placement of basis functions at lower resolutions. 


Valid Gaussian process 

Proposition 1. The M-RA is a valid Gaussian process with a nonnegative definite covari¬ 
ance function, Cm ■ 

“Optimal” basis functions 

At every resolution m = 0,..., M — 1 and within every region the goal of the M-RA 

in (1) is to approximate the remainder process S m (-) as closely as possible, where 

MO = [Mi(0 - An-i(0]M = bo(0 - EZ'o 1 r(0]h- (4) 

Hence, in each region, S m (-) in (1) is the difference between the true process t/ 0 (-) and the 
“previous” terms at lower resolutions, Ea'=Z t(')- We choose r m (-) to be the predictive- 
process approximation of S m (-). As this is a conditional expectation, r m (-) is the function of 
^m(Q^) that minimizes the expected squared difference to 5 m (s), conditional on S m (Q^) 
(Banerjee et ah, 2008). Further, r m (-) can be viewed as an approximation of the optimal 
rank-r representation of S m (-) within each region in that r m (-) is the Nystrom 

approximation of the first r terms in the Karhunen-Loeve expansion of S m (-) (Sang and 
Huang, 2012). This is further evidence that at each resolution the predictive process captures 
variability at the low frequencies, leaving mostly higher-frequency variability to be captured 
at higher resolutions within smaller subregions. For increasing r, r m (-) will be increasingly 
close to S m (-). In fact, if Qj 1 ,...j rn is equal to S v .... i3rn , the set of observed locations in 
then it is straightforward to show that = this sense, r m (-) (and 

its basis-function representation) are an “optimal” approximation of 

In contrast to many other multi-resolution methods for spatial data, the M-RA thus 
automatically provides a basis-function representation to approximate a given covariance 
function Co (based on a particular domain partitioning and set of knots), without any re¬ 
strictions on Co- This is illustrated in Figure 2, which shows the basis functions of a 3-RA 
for a highly nonstationary covariance function Cq. 
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Quality of the covariance approximation 

At resolution m, the M-RA attempts to capture the covariance of the remainder S m (•) 
between the partitions of each region 77,^ by the predictive-process basis-functions com¬ 
ponent How close the covariance of the M-RA, (7 m(si, s 2 ), is to the true 

covariance, Co(si,s 2 ), depends on up to which resolution Si and s 2 lie in the same region. 
If Si and s 2 are in the same region at resolution M, then Cm{ Si, s 2 ) = Co(si, s 2 ). (To prove 
this, simply combine (1) with (4).) This also implies that the variances of yo(-) and Um{-) 
are the same. If Si and s 2 are in the same region T> JX at resolution m < M , but not at 
resolution m + 1, then C' 0 (si, s 2 ) is only approximated by the basis functions up to resolution 
m: C M { Si,s 2 ) = E/™o b ji,-ji( s i) ,K ii,-ji b ii,-ji( s 2)- 

Comparison to the full-scale approximation 

Important special cases of the M-RA are the original process yo for M = 0, and the full-scale 
approximation (Snelson and Ghahramani, 2007; Sang et ah, 2011) for M = 1. The full-scale 
approximation, or 1-RA, only has basis functions at one resolution with r F knots Q F , and a 
single level of domain partitioning, 77 = (J ?=| jF 77 ). For massive datasets, the subregions 

77) need to be very small to maintain computational feasibility. 

Comparing our M-RA (with M > 1) and a full-scale approximation (1-RA) with Q h = Q 
and {77) : j = 1,..., J F } = {77^ J M : j 1 ,... ,j M = 1,..., J}, the covariance approximation 
for the two models is the same for pairs of locations in the same finest subregion 77) and 
for pairs in different subregions 77; and 77 j with i ^ j at the (coarsest) resolution 1. For all 
other pairs of locations, the M-RA has extra basis functions to capture their dependence, 
and if r is sufficiently large that r m (-) captures the dependence of <5 m (-) between subregions 
77 ji .....j rn+l well, the M-RA will provide a better approximation of yo(-) than the 1-RA. 

As described later in Section 3.6, the M-RA with r knots has the same computational 
complexity as the 1-RA with Mr knots. As illustrated in Figure 1, the M-RA can result in 
considerably better approximations. Further comparisons are presented in Section 4. 

2.5 More on the choice of knots and partitions 

To achieve good approximations, we recommend choosing M and J as small and r as large as 
the computational resources allow (see later in Table la), under the constraint that rJ M > n. 

If the observation locations are approximately uniformly distributed over the domain 77, 
the partitions can simply be obtained by recursively splitting each region into J subregions of 
equal area. If the observation locations are far from uniform, more complicated partitioning 
schemes might be necessary to achieve fast inference. 

The remaining issue is the placement of the r knots within each region. A simple so¬ 
lution is to use equidistant grids over each region 77^ Jm , but it can also be advanta¬ 
geous to place more knots close to the boundaries within each region. To see why, re¬ 
member from (4) and Section 2.3 that the goal within each region 77^ is to approx¬ 
imate S m (•) ~ GP(0,v m ). Consider the case of a region with J = 2 subregions con¬ 
taining observed locations S = {du,<S 2 } with Sj = where Sf are the locations 

within a distance c from the boundary and Sj are the remaining locations in the interior 
of subregion j. Choosing the knots = S B := it can be shown that 

var(S m (S )) = var(T m (S )) + uar(5 m (d>)|<5. m (5 B )), the latter being a matrix with only one 



nonzero block, var( 6 m (S I )\ 6 rn (S B )). The only part of this matrix that is ignored by the 
M-RA is cov( 6 m (S{), S m (S 2 )\ 8 m (S B )), which should be very small if c is large and/or the 
screening effect (e.g., Stein, 2011) holds for v m . 

An extreme case of this strategy is illustrated in Figure 1. For the exponential covariance 
function without nugget in one spatial dimension, the screening effect holds exactly, in that 
two observations are conditionally independent given a third observation that separates the 
two. Because the knots for a particular resolution in Figure 1 are placed on the boundaries 
between partitions at the next higher resolution, the M-RA is exact in this case. For co- 
variance functions without screening effect or in higher dimensions, the M-RA will generally 
not be exact. 

While the (favorable) numerical results in Section 4.1 are obtained with the simplest 
choice of equal-area partitions and equally spaced knots, it is possible to adopt more compli¬ 
cated strategies, such as choosing the knots and partitions based on clustering (e.g., Snelson 
and Ghahramani, 2007) or using reversible-jump Markov chain Monte Carlo (e.g., Gramacy 
and Lee, 2008; Katzfuss, 2013). Any potential boundary effects due to the choice of parti¬ 
tions can be alleviated by carrying out several M-RAs with different, shifted partitions and 
combining the results using Bayesian model averaging (e.g., Hoeting et ah, 1999). 


3 Inference 

In this section, we describe inference for the M-RA. For a particular value of the param¬ 
eter vector 6 , the covariance function Cq , and hence the basis functions band the 
covariance matrices Kin (3) are fixed. The prerequisite for inference is to calculate 
the quantities summarizing the prior distribution induced by the M-RA at the chosen knots 
and observed locations (Section 3.1). Then, the main task for inference is to obtain the 
posterior distribution of the unknown weight vectors £m-i (Section 3.2), where we define 
£ m := {? 7 j: ji ,..., ji — 1,..., J; l — 0,..., m} for all m — 0,..., M — 1 to be the set of 
all basis-function weights at resolution m and all lower resolutions (and we let £_i = 0 be the 
empty set). Once this posterior distribution has been obtained, it can be used to evaluate 
the likelihood (Section 3.3) and to obtain spatial predictions (Section 3.4). By exploiting 
the block-sparse multi-resolution structure of the prior and posterior precision matrices of 
the weights, we obtain inference that has excellent time and memory complexity (Section 
3.6), can take full advantage of distributed-memory systems with a large number of nodes 
(Section 3.5), and is thus scalable to massive spatial datasets. 


3.1 Calculating the prior quantities 

Let S 


be the observation locations that lie in region T > n and define 


B L 

• • ijm 

■— h/'ivbiW.h,- 

Six, 

• • ? jm 

:= var(y M (<S jl 

Vix, 

• • ijm 

:= var(y M {S h 


, m, 


( 5 ) 


for m = 0,1,..., M — 1, and := 
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For inference, we explicitly need to calculate the matrices {Kd - m : ji ,..., j m = 1,..., J; m 
0, ...,M-1}, = 1,..., J;Z = 0, and Ji,..., j M = 


, J}. Defining jm := vi{Q 3i we can do so by calculating 


W 


= a>(Q,w ro , 




( 6 ) 


for m = 0,..., M, j u ..., j m = 1,..., J, and l = 0,..., m. Then we have K.J jm = W™ 

for m < M, ., A/ = W* lt i JM for / < M, and £.„. JA , = W),'... JA ,- 

As an aside, other parameterizations of these matrices (and the quantities in (3)) are also 
possible and will lead to similar inference algorithms as described later, as long as the weight 
vectors are a priori independent and the basis functions have the same limited support. 


3.2 The posterior distribution of the basis-function weights 

The definition of the M-RA in (3), together with the definitions in (5), imply that 


yM{Sj lt ...j m )\S m ~ dV (^ i=0 Vju—jm)- 


Using the results in Katzfuss and Hammerling (2014, Sect. 3), it can be shown that the 
conditional posterior distributions of the weight vectors for all m = 0,..., M — 1 are given 
by 

r 1ji r --,jm\y m(S)i £m-l ~ dV r (lZj 1) Kj 1) jl, ■ ■ ■ , jm = 1, ■ ■ • , J, (7) 

with posterior precision and mean 


K = K n-,jm + B n,-,jJV B™,... Jm = KT| + A-- m , 

U jl,-Jm = ^-31,-Jm{ B jl,...,j m ^jl....,jm ^ M (^31 ,- ,3m ) _ Hl =0 B jl ^ J^jl .... ,jl ) ) 


( 8 ) 


= K 


, .111 

i< .. j 


. A m,< n 

Z-^l= 0 


m,l 


respectively, where 


A}- 1 , 

:= 

, 'Vt 1 

, B' , 

_ v-'vj 

Jl 

3 lv 

■•■>Jm Jl 

-ijm Jl vjm 

U3 k 

31,-,3m 

: = B i, 

. 'V7 1 

-,3m 31,- 


"dm) = Sj m+1 = l 


(9) 


can be obtained recursively for m — M - 1, M — 2,..., 0 using 


A k / , 

:= B a 

, 'S' 1 , B' 

JlvJm 

ji.- 

Jlf'iJm Jlv 

31,-,3m 

: = B L 

..,c. 


= 


Cl> 


Jl, 


\ k ’ m K ■ A m ’ 1 

jlr-jm hlr-JmA'lvJm’ 

_ A k ’ m ic , 


( 10 ) 


In practice, the quantities in (10) are calculated directly from the definition (first equality) 
for m = M , and using the recursive expression (second equality) for m — M — 1,..., 0. The 
proof of results (9)-(10) is straightforward using the Sherman-Morrison-Woodbury formula 
(Sherman and Morrison, 1950; Woodbury, 1950; Henderson and Searle, 1981): 


S" 1 • 

31,-,3m 


v . 1 . _ V 1 ■ 15"' K • IV" . 'V • 1 ■ 

31,-,3m Jl,-,3m 31,-,3m 31, —,3m Jl,-,Jm Jl, —,3m' 


( 11 ) 
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3.3 Parameter inference 

Inference for the M -RA is based on Lm{0), the likelihood of the observations y m{<S) ~ 
N n ( 0,S), where S = X(0) is the M -RA covariance matrix given in (5) with m = 0 based 
on parameter vector 6 . 


-2 log L m (0) = log |S| +y M (5) / S 1 y M {S). 

This likelihood can be calculated using the quantities in Section 3.2. We have —2 log Lm( 0) = 
d + u, with 


^31, ■,3m 


m — 0,1,..., M. 


For m = M, these quantities are calculated using the definition. For m — M — 1,..., 0, the 
recursive expressions 


U 3 l, — , 3 m 


loglKjt. .jJ - log|K-‘...jJ + Ei +1 =i d h . 


■Jm+1 ) 


-(jJ 


K 

JlvJm 
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+ E. 


J 


jm+ 1 = 1 U 3 U---, 3 m+l 1 


can be derived using a matrix determinant lemma (e.g., Harvillc, 1997, Thm. 18.1.1) and 
the Sherman-Morrison-Woodbury formula in (11). 

In summary, the M -RA log-likelihood can be written as a sum of log-determinants and 
quadratic forms involving only rxr matrices. This result enables fast and scalable evaluation 
of the likelihood, which in turn allows for a wide array of likelihood-based inference techniques 
for an unknown parameter vector 6 , such as maximum-likelihood estimation, Markov chain 
Monte Carlo, or particle filtering in spatio-temporal contexts. 


3.4 Spatial prediction 

Spatial prediction can be carried out separately, after parameter inference is completed. In 
a frequentist context, prediction only has to be carried out once, for the final parameter 
estimates. In a Bayesian framework, parameter inference can be carried out only for the 
thinned MCMC chain, or for particles with considerable weight in the case of a particle 
sampler. 

Implicitly conditioning on a particular value of the parameter vector 0, spatial prediction 
amounts to hireling the posterior predictive distribution, yM(<S p )\yM(<S), at a set of predic¬ 
tion locations S p . We denote by the prediction locations in region As a 

first step, we need to calculate prior prediction quantities similar to (6), 


U 


i 

jl,-, 3 m 


■■= MSL.3M’ . m) = CoQn ,...„) - EUuJ I ...^K ili ...^W} 1> .„ J /, 


for l = 0,..., M, and then we set := r. u (S' . Jai .S j: . , A( ) = U™ Jm and 


V 


: = = Co(S'i. jM ,S'; . Ja; ) - Eto 


3 l,—, 3 M 


■ XJ k ■ ' 

■A Ilr-JM ' 


Spatial predictions can then be obtained using the following proposition. 
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Figure 3: Illustration of the computational setup for distributed inference in the 2-RA with J = 2 and r = 5 
for a toy example in a one-dimensional spatial domain. As indicated by matching colors, node 
works with the knots jm = 1,2; in = 0,1,2. Only communication between connected nodes is 

necessary. 


Proposition 2. The posterior predictive distribution can be written as 

yv( s L.jJ y.v/(s) = + 

where 


( 12 ) 


r!,-, , ‘~JV r (K fl , a>” , ,K,-. , ), 

ind. 


M 


Si .. ~ .*,),vj.« -.. „ 


and £/ie “posterior basis-function matrices” are given by 

pWjfc . l_. / pP \ T / — 1 T}/c _ & TD^+lji TV" A 

(13) 

for k — 0,1,..., l — 1. 

Hence, the posterior predictive distribution in (12) has the same form as the (prior) M-RA 
process in (3). This allows calculation and storage of the joint posterior predictive distri¬ 
bution. Often, interest is in summaries of this joint posterior predictive distribution, such 
as the marginal posterior predictive distributions at each prediction location. In practice, 
the posterior basis-function matrices in (13) are calculated directly from the definition (first 
equation) for l = M, and using the recursive relation (second equation) for l — M — 1,..., 0. 


3.5 Distributed computing 

A major advantage of the M-RA is that it is well suited to modern computing environments, 
in that computations can be carried out in a distributed fashion with little communication 
overhead at a large number of nodes, each only dealing with a small subset of the data. 

More specifically, assume that we have nodes {ff'■ ju ■ ■ ■, jm = 1 m = 

0,1,..., M} in a tree-like structure, as illustrated in Figure 3. Each node A r j 1 ,...,j TH holds the 
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Single Processor 
Time Memory 

Distributed 
Time Memory 

0-RA 

n 6 

n 2 



1-RA 

2 

nr 

nr 

r 3 

r 2 

M-RA 

n(Mr ) 2 

nMr 

(Mr) 3 

Mr 2 


(a) Increasing r and M 



Single Processor 

Distributed 


Time 

Memory 

Time Memory 

M-RA 

n log 2 n 

nlogn 

log a n log n 


(b) M = O(logn) 


Table 1: Time and memory complexity of the M-RA and its special cases, regular kriging (0-RA) and 
full-scale approximation (1-RA), on a single computer and in the distributed setting of Section 3.5 


r knots or observation locations located in subregion and it only has to 

work with matrices of size r x r (implying excellent load balance). The main computational 
effort for node Af n is in computing the Cholesky decomposition of the r x r matrix 
and calculating the quantities in (10), the latter of which could be parallelized if 
the node has multiple cores. The communication to each node is 0(Jm 2 r 2 ), as it 

receives the matrices to calculate the quantities in (9) from its children. The calculations at 
the nodes/subregions for each resolution can be carried out completely in parallel. 

For spatial prediction at locations S p , each (terminal) node carries out parallel 

computations involving S p - M , the prediction locations in region to obtain the 

“posterior basis-function matrices” in (13). 

3.6 Computational complexity 

Remember that we assume here for simplicity that there is an equal number of r = -jt knots 
or observation locations in each region T> Ji Jrn , and we regard J as a fixed (small) number. 
For each of the J m < 2 J M regions, the main computational effort for inference is in 

obtaining the matrices /m in (10), which requires one Cholesky decomposition of the 

r x r matrix and computing 0(m 2 ) quadratic forms of size r x r. Thus, M-RA 

inference has 0(J M M 2 r 3 ) = 0(nr 2 M 2 ) time complexity and 0(nrM ) memory complexity. 

When the M-RA is implemented in a distributed environment with a large number of 
nodes (as in Section 3.5 above), the overall time complexity is 0(M 3 r 3 ) and the memory 
complexity per node is 0(Mr 2 ), assuming that communication (which is 0(M 2 r 2 ) per node) 
does not dominate computation time. 

Thus, the M-RA with r knots has the same computational complexity as the 1-RA 
with Mr knots, but the M-RA can provide a much better approximation (see Figure 1). 
As is further explored in Section 4.1 below, as n is increasing, the performance of the 1- 
RA degrades unless r is allowed to increase as some fraction of n, while for the M-RA we 
can keep r fixed and instead let M increase with n as M = O{logjn). This is a natural 
assumption under increasing-domain asymptotics, for which an increase in the domain and 
data size by a factor of J allows an additional split of the resulting domain (i.e., increasing M 
by one) without degrading the approximation within the J subregions at the first resolution. 
In this case, the time and memory complexity for the M-RA are 0{n\og 2 n) and 0(n\ogn), 
respectively, in the non-distributed setting. In the distributed setting, the overall time 
complexity is then 0(log 3 n), and the memory and communication complexity per node are 
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0 {\ogn) and 0 (\og 2 n) } respectively. 

The same complexities hold for prediction, as long as the number of prediction locations 
per terminal region is on the same order as the number of observed locations (i.e., Uf | = 

0(r)). In addition, the M-RA allows us to store the entire joint predictive distribution 
in 0(A4r 2 J M ) = 0(nMr) memory ( 0(Mr 2 ) per node in the distributed case). If M = 
(9(logn), this is (9(nlogn) (or 0 (\ogn) per node). 

As summarized in Table 1(b), if we let M increase as logn, the time and memory com¬ 
plexity for the M-RA are both quasilinear in n, and even polylogarithmic in distributed 
settings with many nodes. Hence, the M-RA is highly scalable and can handle truly massive 
spatial datasets if enough computational nodes are available. 

4 Numerical comparisons and illustrations 

Using simulated and real data, we compared our proposed M-RA to the full-scale ap¬ 
proximation (1-RA), which is a special case of the M-RA and a current state-of-the-art 
method for large spatial data. A non-distributed implementation of the methods in Julia 
(http: //julialang. org/) version 0.3.7 was run on a 16-core machine (Intel Xeon 2.90GHz) 
with 64GB RAM. All Julia code, R code to produce the plots, and the data for Section 4.2 
are available as supplementary material. 

4.1 Simulation study 

We simulated five datasets, each roughly of size 2 million (specifically, n max = 1,966,080) 
from a Gaussian process with mean zero and covariance function 

Co(si, S 2 ) = 0.95 Ati.5(|si — S 2 1/0.05) + 0.05 /(si = S 2 ), si, S 2 £ T), 

where /(•) is the indicator function and 

M. 1 . 5 (h) — (l + hVs) exp ( — hV3), h £ Mq (14) 

is a Matern correlation function with smoothness parameter 1.5, which is also used for the 
real-data example below in Section 4.2. The data were simulated on an equidistant grid on a 
one-dimensional domain T> = [0,1], which permitted fast simulation using the Davies-Harte 
algorithm and evaluation of the exact likelihood using the Durbin-Levinson algorithm for 
comparison. These algorithms were implemented in Julia along the lines of the functions 
DHsimulate and DLLoglikelihood in the R package ltsa (McLeod et al., 2007). 

From the “full” datasets of size n max , we created datasets of varying sample sizes n roughly 
between 2,000 and 2 million. We considered the two types of asymptotics commonly used 
in spatial statistics: For fixed-domain (infill) asymptotics, we took equally spaced subsets of 
size n from the full dataset on the entire domain [0,1]. For increasing-domain asymptotics, 
we always took the first n observations from the entire set. 

We then recorded the loglikclihood and the time taken to compute it for each n and for 
each of the following four competitors: 
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(a) Loglikelihood under fixed domain 



• observed 
A estimated 



0-RA 

1-RAS 

1-RAF 

M-RA 


(c) Times for likelihood evaluations 



(b) Loglikelihood under increasing domain 
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0.979 
T3 0.957 
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o 
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0.700 , , , , , , , 
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time (min) 

(d) Comparison for fixed n=l,966,080 



Figure 4: Results from the simulation study in one spatial dimension described in Section 4.1. 0-RA is 
the true Gaussian process, and 1-RA S and 1-RA F are full-scale approximations with increasing and fixed 
r, respectively. Note that all axes indicating time or sample size are on a log scale. The log-scores (i.e., 
loglikelihoods) are all scaled relative to the log-score of the M-RA. 


0-RA: A Gaussian process with the true covariance function Go, which provides the best 
possible fit, but scales as 0 (n 3 ). 

1-RA F: A “fast” 1-RA with fixed r = 240 and increasing J = n/240, which scales as 0(n). 

1-RA S: A “slow” 1-RA with fixed J = 64 and increasing r = n/ 64, which scales as 0(n 3 ). 

M-RA: A M-RA as described in Section 3.6, with r = 30, J = 4 and M = log 4 (n/30), 
which scales as G(nlog 2 n). 

For all competitors, the true covariance function (including all parameters) is assumed 
known, and we use the loglikelihood (at the true parameters) implied by each competitor as 
a measure of how well that competitor approximates this true covariance. The loglikelihood 
is equivalent to the log-score, which is a strictly proper scoring rule in the sense that it is 
maximized in expectation by the true model (e.g., Gneiting and Katzfuss, 2014). This means 
that, on average, the 0-RA will have the highest possible log-score. 

The results of these experiments with increasing sample size (averaged over the five 
datasets) are shown in Panels (a)-(c) of Figure 4. The computation times scale roughly as 
expected. We extrapolated the computation times of the 0-RA and 1-RA S for values of n 
for which the simulation machine ran out of memory, but were able to compute the exact 
loglikelihoods for the 0-RA using the Durbin-Levinson algorithm for up to n ~ 500,000. The 
M-RA and the 1-RA F had similar computation times, with the latter becoming slightly 
faster for very large n. The log-scores of the M-RA appear to be getting closer to those of 
the (computationally infeasible) 0-RA and 1-RA S with increasing n, while the log-scores of 
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time (min) 



M-RA 

1-RA 



(a) Without nugget 


(b) With nugget 


Figure 5: For the simulation study in two dimensions, comparison of the M-RA (with varying M) to the 
1-RA (with varying r). Note that the time axes are on a log scale. The loglikelihoods are all scaled relative 
to the loglikelihood of the 8-RA. 


the 1-RA F become increasingly worse relative to the optimum. 

For the largest sample size considered (n = 1,966,080), we further investigated computa¬ 
tion times and log-scores for different versions of the M-RA (with r = 30 and M = 2,4, 8) 
and the 1-RA (with r between 60 and 960). The results are shown in Panel (d) of Figure 
4. The 2-RA and the 4-RA were roughly 8.7 and 11.8, respectively, faster than the fastest 
1-RA with an equal or greater log-score. None of the 1-RAs achieved a log-score as high as 
the 8-RA. 

We also conducted a simulation study in two-dimensional space. We considered n = 
3,211,264 observations with an exponential covariance function with scale 0.3 and variance 
1 on a regular grid on the unit square V = [0, l] 2 . Using the function RFsimulate in the R 
package RandomFields (Schlather et ah, 2015), we simulated five datasets without a nugget, 
and five datasets with a nugget consisting of Gaussian white noise with variance 0.05. For 
both settings, we compared the M-RA with r = 49 and M = 2,4,8 to the 1-RA with r 
between 49 and 784. The averaged results are shown in Figure 5. The relative performance 
of the two methods is similar to the one-dimensional case in Figure 4(d), but in the case 
without a nugget even fast approximations with small r or M achieve a relatively high 
loglikelihood. 

4.2 Analysis of total precipitable water 

We also applied our methodology to n = 271,014 measurements of total precipitable water 
(TPW) made by the Microwave Integrated Retrieval System (MIRS) satellites between 2am 
and 3am UTC on February 1, 2011, over a region covering the United States. The measure¬ 
ments are shown in Panel (a) of Figure 6. The data are noisy and hourly datasets exhibit 
large gaps, which means that prediction of the true underlying TPW field is necessary at 
both observed and unobserved locations. Currently, an ad-hoc operational version of such a 
gap-filled product is sent to National Weather Service offices, where it is used to track the 
movement of water vapor in the atmosphere and to detect conditions that can lead to heavy 
precipitation (see Kidder and Jones, 2007, and Forsythe et al., 2012, for more details). 

We extended our methodology slightly to accommodate the fact that the TPW observa- 
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(f) Posterior means for the block approx. (g) Posterior standard deviations for the block approx. 


Figure 6: 271,014 measurements of total precipitable water (TPW), along with posterior predictive means 
and standard deviations of the true underlying TPW field on a 0.25° x 0.25° grid using three different 
methods. Color scales are in units of mm. 
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maximum likelihood estimation 

random 

test regions 


r 

time/lik. 

d 2 

k 

a? 

loglik. 

RMSPE 

CRPS 

RMSPE 

CRPS 

6-RA 

16.45 

98.32 

26.31 

0.74 

0.29 

1.00 

0.56 

0.22 

3.24 

1.72 

1-RA 

264.59 

108.05 

33.21 

0.78 

0.29 

1.01 

0.56 

0.23 

4.06 

2.19 

block 

1054.53 

94.61 

31.22 

0.73 

0.29 

1.00 

0.56 

0.23 

4.51 

2.38 

local 







0.56 

0.22 

3.79 

2.00 


Table 2: Results of the TPW analysis, block: block-independent approximation; local: local kriging based 
on 20 nearest neighbors; f: average number of knots per region; time/lik.: average time (in seconds) per 
likelihood evaluation; loglik.: loglikelihood relative to the 6-RA loglikelihood; random: test set randomly 
sampled from observations; test regions: randomly selected test regions of size 5° x 5°; RMSPE: root mean- 
square prediction error; CRPS: mean continuous rank probability score (lower is better). 


tions contain measurement error. We assumed that the observations were 

zfa) = y M (si) + e(s *), i = l,...,n, 

where ijm (•) is the M-RA as before, and for simplicity we assumed that we have spatially 
independent measurement error, e(sj) ~ iV( 0 ,< 7 ^). In this case, parameter inference and 
prediction (of y(-), not ;?(•)) can proceed as before, except that we needed to set = 

VM{S jl ,...j M ,S ju ... tjM ) + cr 2 e I below (5). 

We compared the proposed M-RA (with M = 6 ) to the 1-RA (full-scale approximation) 
and to a block-independent approximation (e.g., Stein, 2014), which simply divides the 
domain into subregions and treats the process as independent between subregions. This 
can be viewed as a special case of the 1-RA with zero knots at resolution m = 0. The 
6 -RA had varying J m at different resolutions m, (Ji,..., J§) = (2,2,4,8,8,16), with an 
average number of 16.45 knots per region. The 1-RA had 1,024 subregions with an average 
of 264.59 knots per region, and the block approximation had 256 subregions with an average 
of 1054.53 observations per region. After subtracting a constant mean, some exploratory 
analyses showed that a Matern covariance with smoothness parameter 1.5 fit the data well, 
and so all methods used were approximating a covariance of the form, 

Co( s i ; S2) = a 2 M. 1.5(||Si — S2II//C), 


where A4 1.5 is given in (14). 

We first estimated the unknown parameters a 2 , k, and a 2 by numerically maximizing the 
loglikelihood functions of the three approximation methods, and the resulting estimates and 
maximum loglikelihood values are given in Table 2. Then, using the estimated parameters, 
we computed the posterior distribution of the underlying TPW held on a regular 0.25° x 0.25° 
latitude/longitude grid of size 24,805 over the domain. Marginal summaries (posterior means 
and posterior standard deviations) are shown in Figure 6 . 

Finally, we compared the posterior predictive distributions for three sets of 5,000 ran¬ 
domly selected held-out test data (to evaluate short-range predictions) and for three ran¬ 
domly selected held-out test regions of size 5° x 5° (to evaluate long-range predictions). As 
the true TPW values are unknown, we compared the predictions to the observations, consid¬ 
ering the mean-square prediction error (RMSPE) and the mean continuous rank probability 
score (CRPS). The CRPS is a strictly proper scoring rule that quantifies the fit of the entire 
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predictive distribution (i.e., for a normal distribution, the mean and the variance) to the 
data, and it is on the same scale as the observations (see, e.g., Gneiting and Katzfuss, 2014). 

As the block-independent long-range predictions were poor, we also carried out local 
kriging using the parameter estimates from the block-independent approximation. For every 
prediction location, the local-kriging predictions were based only on the data at the 20 
nearest observed locations. The computation times for each test set were between 150 
and 550 seconds. Note also that the M-RA methods provide the joint posterior predictive 
distribution at all prediction locations, while local kriging only provides marginal posterior 
predictive distributions at each prediction location. 

Summarizing the comparison in Table 2, the first three methods have similar compu¬ 
tation times, maximum loglikclihood values, short-range predictions, and they all produce 
slight artifacts in the posterior-standard-deviation plots in the right column of Figure 6 in 
areas with nearly zero uncertainty. However, the 6-RA produces by far the best long-range 
predictions. In the prediction plots in the left column of Figure 6, strong “blocky” artifacts 
are visible for both the 1-RA and the block approximation. These differences are important 
in many satellite-data applications, where large regions of missing data in hourly or daily 
data are very common due to satellite tracks and non-retrieval (e.g., because of heavy cloud 
coverage). 

5 Conclusions and Future Work 

We have presented the multi-resolution approximation (M-RA), a novel technique for ap¬ 
proximating Gaussian processes with any covariance function. The M-RA is essentially a 
linear combination of many spatial basis functions at multiple resolutions. The precision 
matrix of the basis-function weights has a multi-resolutional block-sparse structure, which 
allows scalable inference and distributed computations. Because the basis functions in our 
methodology are chosen optimally for a given covariance function, this can provide further 
insight on other multi-resolution approaches in which basis functions are chosen in a more 
ad-hoc way. 

The M-RA compares favorably with the full-scale approximation of Sang et al. (2011), 
which is a current state-of-the-art method for large spatial data and can be viewed as a 
special case of the M-RA (with M = 1). Using theoretical results, a toy example, large 
simulated datasets, and a real-data application, we have shown that the M-RA can provide 
a better approximation at the same computational complexity and computation time as the 
1-RA, or it can provide a similar approximation at a fraction of the computational time. It 
should also be noted that our inference results for M = 1 provide an algorithm for parallel, 
distributed computations for inference in the full-scale approximation. 

We are planning on providing user-friendly software that provides good default choices 
for the M-RA and that can be run on both desktop computers and on high-performance 
computing environments. Taking advantage of the distributed-memory architecture of the 
latter should in principle allow applying the M-RA to datasets with hundreds of millions of 
observations, as many satellite instruments are now able to produce on a daily basis. 

The M-RA not only approximates the data covariance matrix, but it is a valid Gaussian 
process in its own right. Extensions to more complicated scenarios are therefore possible by 
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embedding the M -RA process in a hierarchical model (e.g., Cressie and Wikle, 2011). When 
the data measurement process is complex, the M -RA can be embedded in a hierarchical 
model that explicitly models the measurement process, and allows, for example, modeling 
non-Gaussian data, or fusing data from different measurement instruments. 

Also of interest is a spatio-temporal version of the M- RA. Because it is possible to store 
and propagate the entire joint posterior predictive distribution, the M-RA could be extended 
to allow Kalman-filter-type inference in massive spatio-temporal state-space models (which 
is challenging for other sparse-precision approaches such as Lindgren et ah, 2011). In this 
sense, the M-RA might also provide an alternative to the ensemble Kalman filter (Evensen, 
1994; Katzfuss et al., 2015) in certain situations. 
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A Proofs 

Proof of Proposition 1. For any set of locations S C T>, y m(S) in the form (3) is a linear combination 
of the vector consisting of all basis-function weights, which has a multivariate normal distribution, and so 
Um(') is a Gaussian process. 

Further, note that um(-) in (1) is a sum of independent components, r m (s) = £ l (J m (s)|^ m (Q^ m ')), 
where S m (-) is independent between regions Starting with 6q(-) = yo(-), we can show iteratively 

for to = 1,..., M — 1 using the law of total variance that, for any finite set C T>j y ...,j m i the matrix 

var(Sm(S ju ..., jm )) = var(5 m ^ 1 {S jlt ...jJ) - var(E(S m -i{ s h,-,jm)\ S rn-i(Qj u ...,j m . 1 ))) 

is nonnegative definite. Thus, the covariance functions of the S m ( •), the r m (-), and of Um(') are nonnegative 
definite. □ 


Proof of Proposition 2. For j- t ,..., j rn = 1 to = 0,1,..., M, and l = 0 ,... ,m, define 
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Note that, for l < M, we have L = H, Jm K , . t 15', P • T,( jm . where is a sparse 

block matrix with the only nonzero block being LG 1 • . Hence, it can be shown that j„^Ji ;; B), „ 
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L ; , +1 S. 1 B* i . Using a variant of the Sherman-Morrison-Woodbury formula, we can also show 
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which proves (13). 

It is easy to see that the desired posterior predictive distribution is multivariate normal, yM(<$ P )|yM(<S) ~ 
and so spatial prediction amounts to finding the posterior mean and covariance matrix, /x and 
respectively. To obtain these quantities, note that we have from well-known properties of the multivariate 
normal distribution that 
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= — — + Efc=0 

By the law of total expectation, we therefore have 

/'v = ^(/4. 2 ) 

= T E.j.y.'bE..) + eL=o & l ji,...,j m r iju—>jk b 


(15) 


= u 


>1,1-1 

i h,-,3m V iu-,3l-l 

^ 1-2 f >l,k 

■,3l-l' M 3U---,3l-l ~*" 


(16) 


and by the law of total covariance, we have 
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l ji,-,3M from ( 15 ) an d 

and iteratively applying (16) and (17) with m = M for l = M,..., 0. 


■ M 


The result (12) follows by starting with //]'.from (15) and M'; 7 .= V 7 ,’. jV L+ . L r. JM , 

□ 
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